Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

Yulin Luo; Chun-Kai Fan; Menghang Dong; Jiayu Shi; Xiangju Mi; Mengdi Zhao; Bo-Wen Zhang; Cheng Chi; Jiaming Liu; Gaole Dai; Rongyu Zhang; Ruichuan An; Kun Wu; Zhengping Che; Shaoxuan Xie; Guocai Yao; Zhongxia Zhao; Pengwei Wang; Guang Liu; Zhongyuan Wang; Tiejun Huang; Shanghang Zhang

RoboBench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

Accepted to ECCV 2026

Yulin Luo^1*, Chun-Kai Fan^1*, Menghang Dong^1*, Jiayu Shi^1*, Xiangju Mi^1*, Mengdi Zhao^3*†, Bo-Wen Zhang^4*, Cheng Chi^2*†, Jiaming Liu¹, Gaole Dai¹, Rongyu Zhang¹, Ruichuan An¹, Kun Wu⁵, Zhengping Che⁵, Shaoxuan Xie², Guocai Yao², Zhongxia Zhao^1,2, Pengwei Wang², Guang Liu², Zhongyuan Wang², Tiejun Huang^1,2, Shanghang Zhang^1,2✉

¹ State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University ² Beijing Academy of Artificial Intelligence ³ Institute for Brain and Intelligence, Fudan University ⁴ University of Science and Technology Beijing ⁵ Beijing Innovation Center of Humanoid Robotics
^*Equal contribution, ^†Project leader, ^✉Corresponding author

Paper Code Dataset Results

Overview of RoboBench. We evaluate MLLMs as embodied brains across 5 dimensions, 14 capabilities, and 25 tasks, with tasks color-coded by dimension (top left). These dimensions follow the embodied execution pipeline (bottom)—from understanding intent, perceiving the environment, planning and adapting actions, refining subgoals via affordances, to diagnosing failures—capturing the core cognitive roles of System 2. Performance comparison (top middle) reveals significant gaps among state-of-the-art MLLMs. (top right) Perception-focused RoboBench scores show strong associations with downstream CALVIN performance in our VLM-VLA analysis.

Abstract

Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential for advancing robotic intelligence. Yet existing benchmarks emphasize execution success, or, when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the distinct cognitive roles required across the full manipulation pipeline, RoboBench defines five dimensions—Instruction Comprehension, Perception Reasoning, Generalized Planning, Affordance Prediction, and Failure Analysis—spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, multi-view scenes, and memory-driven navigation, drawing from large-scale real robotic data and in-house collection. For planning, RoboBench introduces an evaluation framework that uses an MLLM as a world simulator. It moves beyond symbolic matching to evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes under physical and visual constraints, enabling structured assessment of long-horizon reasoning. Experiments on 18 state-of-the-art MLLMs reveal persistent limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. We further analyze how embodied cognitive abilities relate to downstream robotic control. RoboBench provides a comprehensive scaffold to quantify high-level cognition, clarify the role of the embodied brain, and inform the development of next-generation MLLMs for more robust robotic intelligence.

News

📊 2026.07 - Congratulations to HY-Embodied! HY-Embodied-0.5 MoT-2B and Hy-Embodied-VLM-1.0 A3B include RoboBench-MCQ and RoboBench-Planning as part of their official evaluation suite, highlighting RoboBench as a recognized benchmark for embodied foundation models.
🎉 2026.07 - RoboBench is accepted to ECCV 2026! This page now reports the official ECCV 2026 leaderboard, covering 18 state-of-the-art MLLMs (GPT-5.4 / GPT-5, Claude-Opus-4.7 / Sonnet-4.6, Gemini-3.1-Pro, Qwen3-VL, MiMo-Embodied, RoboBrain-2.5, and more), evaluated with the MLLM-as-world-simulator planning framework.
📰 2026.07 - RoboBench is featured by 具身智能之心, introducing our benchmark for evaluating MLLMs as embodied brains.
🔥 2025.10.23 - Dataset and code have been released! If you encounter any issues, please feel free to submit an issue on GitHub, and we will check and address them as soon as possible❗️
🔥 2025.10.21 - The paper has been released! Code and dataset are being organized and will be released soon. Stay tuned❗️

Highlight

🔍 Benchmark Overview

RoboBench systematically evaluates MLLMs as embodied brains rather than only measuring final robot execution success.
The benchmark spans 5 core dimensions, 14 capabilities, 25 task types, and 6,092 high-quality QA pairs.

🧭 Embodied Execution Pipeline

Tasks cover the full manipulation reasoning pipeline: instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis.

🛡️ Planning Beyond Text Matching

Long-horizon planning is evaluated with a DAG-guided MLLM-as-world-simulator that checks action alignment and task completion under physical and visual constraints.
Q2 and Q3 further test next-step prediction and task-state estimation, making the planning evaluation interpretable at the step level.

🧠 Real-world Data

Data are curated from open-source real-robot datasets and in-house collection across diverse embodiments, objects, views, and navigation scenarios.
Dimension-specific validation and human review keep QA items grounded in the visible scene and executable task context.

🌍 Broad Model Audit and Downstream Signal

The official leaderboard covers 18 state-of-the-art closed-source, open-source, and embodied MLLMs, plus a text-only ablation.
Selected RoboBench scores show benchmark-dependent associations with downstream VLA performance, offering diagnostic signals for robot-control evaluation.

Demo Case

Demo case. A representative RoboBench item illustrates how a robot scene is converted into grounded questions and evaluated across embodied reasoning skills, including perception, planning, affordance prediction, and failure analysis.

Leaderboard

Model	Perception Reasoning
	Robotic-centric		Object-centric		Scene-centric			Task-centric	Avg
	Robot-type▼	Robot-view▼	Static Attr.▼	Functional Attr.▼	Spatial Relation▼	Temp. Grounding▼	Causality▼	Refer. Comprehen.▼	Avg
Basic Reference
Human Evaluation	80.67	79.08	43.77	83.89	70.91	51.61	91.22	93.22	74.30
GPT-5.4-text-only	25.86	28.26	8.81	45.57	32.67	22.90	34.48	18.40	27.12
Closed-Source MLLMs
GPT-5.4	73.28	50.00	42.86	73.42	54.46	38.93	45.52	71.17	56.20
GPT-5.2	68.10	39.86	38.60	77.22	47.52	30.53	54.48	71.17	53.44
GPT-5	64.66	47.10	49.24	69.62	54.46	48.09	74.48	78.53	60.77
GPT-4.1	66.38	50.00	40.43	68.35	47.52	22.14	56.55	73.01	53.05
GPT-4o	75.00	39.13	18.24	60.76	49.50	22.14	43.45	55.21	45.43
Claude-Opus-4.7	76.72	53.62	57.14	81.01	48.51	46.56	65.52	71.78	62.61
Claude-Sonnet-4.6	53.45	47.83	53.80	69.62	52.48	29.01	57.24	69.33	54.09
Claude-Sonnet-4.5	46.55	33.33	37.08	72.15	48.51	33.59	51.72	36.81	44.97
Claude-Haiku-4.5	44.83	33.33	30.70	56.96	25.74	22.14	45.52	27.61	35.85
Gemini-3.1-Pro	71.55	49.28	66.26	78.48	61.90	31.93	88.97	90.18	67.32
Gemini-2.5-Pro	67.24	43.48	57.14	82.28	57.43	50.38	73.10	80.37	63.93
Gemini-2.5-Flash	66.38	34.78	57.75	74.68	55.45	34.92	75.17	76.69	59.48
Open-Source Multi-Image MLLMs
Qwen3-VL-8B	52.59	36.96	27.66	65.82	36.63	25.95	31.72	54.60	41.49
Qwen2.5-VL-7B-Ins	37.07	23.19	24.32	56.96	26.73	22.14	33.10	34.36	32.23
LLaVA-OneVision-7B	31.03	26.81	39.21	68.35	42.57	18.32	33.79	50.92	38.88
Embodied MLLMs
RoboBrain-2.0-7B	31.90	19.57	28.57	44.30	34.65	21.37	24.83	33.13	29.79
RoboBrain-2.5-4B	35.34	24.64	39.82	77.22	53.47	18.32	59.31	44.17	44.04
MiMo-Embodied-7B	25.86	21.74	32.22	65.82	49.50	24.43	57.93	43.56	40.13

Model	Instruction Comprehension			Generalized Planning (Q1)
	Explicit▼	Implicit▼	Avg▼	Cross-Embodiment Planning				Cross-Object Planning			Cross-View Planning		Cross-Task Planning	Avg
				Single-arm▼	Dual-arm▼	Mobile-manip.▼	Human▼	Material Afford.▼	Physical Attr.▼	World Knowl.▼	Multi▼	Single▼	Navigation Plan.▼
Basic Reference
Human Evaluation	59.94	61.13	60.54	72.50	41.93	41.55	62.28	56.70	58.98	49.36	52.82	51.59	45.23	54.50
GPT-5.4-text-only	74.88	38.54	56.71	83.53	66.47	74.71	62.65	76.03	80.36	66.95	71.33	72.41	47.66	73.74
Closed-Source MLLMs
GPT-5.4	74.58	50.80	62.69	85.56	70.28	75.59	56.26	78.64	64.83	72.44	73.33	72.50	53.92	70.91
GPT-5.2	75.90	48.85	62.38	86.92	70.00	80.38	60.08	81.60	64.62	75.49	75.84	71.22	55.03	72.31
GPT-5	77.63	54.71	66.17	84.25	69.48	81.27	70.13	81.58	59.80	71.95	70.95	72.76	58.29	71.84
GPT-4.1	76.23	57.30	66.77	88.08	68.56	78.17	55.38	81.37	65.76	64.88	73.70	71.83	52.94	71.79
GPT-4o	74.22	54.90	64.56	86.02	66.40	76.44	62.31	80.86	72.63	73.78	64.50	65.63	57.62	73.46
Claude-Opus-4.7	73.92	61.52	67.72	89.43	71.98	84.30	68.53	84.56	67.70	70.00	75.06	72.06	59.53	75.54
Claude-Sonnet-4.6	79.94	61.38	70.66	88.89	73.93	84.42	66.19	84.76	79.10	73.54	80.55	77.90	67.06	79.38
Claude-Sonnet-4.5	76.65	53.62	65.13	89.11	75.06	81.70	64.27	85.10	69.70	76.22	82.17	75.06	59.52	76.62
Claude-Haiku-4.5	73.78	42.88	58.33	86.13	74.07	76.63	60.38	81.37	58.93	71.75	78.48	71.73	50.87	71.01
Gemini-3.1-Pro	73.25	59.90	66.58	80.64	69.85	79.90	50.13	73.11	71.66	69.63	74.64	74.39	57.14	70.71
Gemini-2.5-Pro	76.20	60.96	68.58	83.53	69.08	84.31	59.16	76.72	66.93	77.68	73.57	75.24	55.34	71.50
Gemini-2.5-Flash	71.45	49.90	60.67	83.58	69.41	81.06	58.43	75.72	70.76	74.88	72.14	72.65	55.08	70.98
Open-Source Multi-Image MLLMs
Qwen3-VL-8B	59.46	30.80	45.13	74.49	44.54	57.98	52.25	63.88	54.66	55.49	49.15	58.75	37.22	56.71
Qwen2.5-VL-7B-Ins	56.04	23.90	39.97	73.89	30.19	56.06	53.85	58.90	57.78	53.90	25.83	37.50	11.95	49.92
LLaVA-OneVision-7B	38.25	10.61	24.43	54.87	31.05	35.88	43.99	37.59	51.37	30.00	31.43	36.60	25.11	41.02
Embodied MLLMs
RoboBrain-2.0-7B	43.54	21.10	32.32	62.49	30.16	44.42	42.90	46.62	52.87	45.24	31.25	32.69	25.98	45.12
RoboBrain-2.5-4B	36.30	16.65	26.47	39.32	23.99	45.87	54.53	31.69	29.16	24.39	28.39	25.75	23.97	31.85
MiMo-Embodied-7B	66.87	37.30	52.09	82.20	37.11	61.76	63.03	73.05	66.85	70.88	58.95	43.88	28.71	62.72

Model	Instr. Compre.		Generalized Planning
	Explicit Goal		Single Arm		Material Afford.		World Knowl.
	Q2▼	Q3▼	Q2▼	Q3▼	Q2▼	Q3▼	Q2▼	Q3▼
Basic Reference
Human Evaluation	45.28	74.32	27.52	71.35	43.62	71.20	43.89	69.83
GPT-5.4-text-only	36.98	46.25	40.98	52.86	40.43	52.33	43.01	41.46
Closed-Source MLLMs
GPT-5.4	49.48	62.50	48.20	67.85	44.38	64.67	42.19	51.22
GPT-5.2	39.32	75.00	42.86	73.02	41.07	66.67	37.50	56.10
GPT-5	44.09	72.97	47.26	75.75	44.38	62.83	39.58	63.41
GPT-4.1	45.31	70.00	48.62	63.76	44.32	63.67	39.58	58.54
GPT-4o	42.86	65.00	43.94	59.95	41.23	55.33	42.19	51.22
Claude-Opus-4.7	44.53	67.50	51.08	65.12	45.97	63.00	45.31	63.41
Claude-Sonnet-4.6	43.75	70.00	43.22	66.21	43.90	64.33	39.58	60.98
Claude-Sonnet-4.5	41.67	61.25	44.73	56.68	41.99	54.00	41.15	48.78
Claude-Haiku-4.5	34.13	62.50	43.07	60.38	38.14	63.18	31.77	70.73
Gemini-3.1-Pro	44.79	71.25	50.65	64.58	46.97	67.67	26.56	65.85
Gemini-2.5-Pro	37.76	72.50	52.02	71.66	49.20	70.00	35.42	68.29
Gemini-2.5-Flash	47.66	77.50	49.85	54.22	44.81	68.83	45.31	68.29
Open-Source Multi-Image MLLMs
Qwen3-VL-8B	49.74	62.50	50.94	61.58	44.80	55.50	36.98	56.10
Qwen2.5-VL-7B-Ins	29.43	55.00	31.88	52.59	30.17	49.33	20.31	51.22
LLaVA-OneVision-7B	33.59	41.25	35.06	46.05	35.97	40.50	34.90	43.90
Embodied MLLMs
RoboBrain-2.0-7B	34.90	52.50	33.84	53.95	34.29	49.00	27.60	48.78
RoboBrain-2.5-4B	32.81	58.75	37.37	54.50	35.25	55.17	32.29	56.10
MiMo-Embodied-7B	38.17	55.00	43.87	55.31	42.90	51.50	46.88	63.41

Model	Affordance Prediction				Failure Analysis
Model	Static▼	Dynamic▼	Naviga.▼	Avg▼	Execution▼	Planning▼	Avg▼
Basic Reference
Human Evaluation	86.08	80.02	81.85	82.63	47.30	80.67	63.99
GPT-5.4-text-only	23.81	27.52	25.51	25.61	11.92	32.64	22.28
Closed-Source MLLMs
GPT-5.4	44.22	36.91	58.16	46.43	26.49	65.97	46.23
GPT-5.2	43.54	39.60	47.96	43.70	26.49	68.06	47.27
GPT-5	62.59	49.66	62.24	58.16	19.87	80.56	50.21
GPT-4.1	29.93	42.95	68.37	47.08	20.53	70.83	45.68
GPT-4o	40.82	42.28	50.00	44.37	31.79	57.64	44.71
Claude-Opus-4.7	53.74	62.42	79.59	65.25	14.57	72.22	43.40
Claude-Sonnet-4.6	37.41	52.35	41.84	43.87	17.88	77.78	47.83
Claude-Sonnet-4.5	34.69	38.93	53.06	42.23	14.57	63.19	38.88
Claude-Haiku-4.5	27.89	26.17	21.43	25.16	17.22	45.83	31.53
Gemini-3.1-Pro	82.31	77.85	96.94	85.70	25.17	80.74	52.95
Gemini-2.5-Pro	65.99	61.07	93.88	73.65	18.54	72.22	45.38
Gemini-2.5-Flash	61.22	69.80	36.73	55.92	25.83	65.49	45.66
Open-Source Multi-Image MLLMs
Qwen3-VL-8B	23.81	17.45	22.45	21.24	22.52	55.56	39.04
Qwen2.5-VL-7B-Ins	18.37	31.54	26.53	25.48	13.91	35.42	24.66
LLaVA-OneVision-7B	38.78	33.56	66.33	46.22	20.53	31.25	25.89
Embodied MLLMs
RoboBrain-2.0-7B	31.97	27.52	31.63	30.37	15.23	40.28	27.75
RoboBrain-2.5-4B	50.34	21.48	72.45	48.09	43.71	46.53	45.12
MiMo-Embodied-7B	51.70	36.91	70.41	53.01	19.21	42.36	30.78

Key Findings from RoboBench Evaluation

Overall Findings

🥇 Large Capability Gaps, yet the Frontier Keeps Advancing

Gemini-3.1-Pro shows the most consistent advantages across perception, affordance, and failure dimensions—67.32 in perception reasoning (vs. the next-best 63.93 of Gemini-2.5-Pro), 85.70 in affordance prediction, and 52.95 in failure analysis—narrowing the gap on selected perception and affordance subsets while remaining uneven across tasks. Most other MLLMs are still highly uneven or generally weak.

🔒 Closed-Source Models Lead by ~20 Points

Closed-source MLLMs lead open-source ones in every dimension, by about 20 points on average (~50% relative)—widest in instruction comprehension (~28) and generalized planning (~25), narrowest in failure analysis (~13). Within the same family, performance improves consistently with model size and generation.

🧭 Plans Sound Plausible but Break at Execution

Planning failures are dominated by a perception-action gap: 45% are execution errors from missing or incorrect action sequences, while 24% are identification errors, 25% commonsense or physical-constraint errors, and 6% mode-specific format errors. Current MLLMs often reason plausibly but still fail to produce executable embodied actions.

👀 Vision Is Truly Required, Not Commonsense Recall

A text-only baseline (GPT-5.4 without images) stays close to random on perception (27.12) and affordance (25.61) tasks, far below the best vision-conditioned MLLM (67.32 / 85.70). RoboBench questions demand grounding in the observed scene—visual state, embodiment, and physical feasibility.

Fine-grained Findings

🧠 Implicit Intent Understanding Remains a Major Challenge

Even the strongest explicit-goal model (Claude-Sonnet-4.6) drops from 79.94 to 61.38 when instructions become implicit, and the gap widens for weaker MLLMs. A paired chain-of-thought rewriting ablation shows this is a genuine intent-grounding limitation, not a prompting artifact.

👁️ Perception Bottlenecks: Embodiment and Time

Models handle object attributes well (up to 82.28 on functional attributes) but struggle with robotic perception and spatiotemporal reasoning: the best scores are only 53.62 on robot-view understanding and 50.38 on temporal grounding—the two weakest perception tasks. Stronger embodiment-aware perception and explicit spatiotemporal reasoning are needed.

🧩 Planning Limitations Persist

Cross-embodiment: models trained mostly on single-arm settings fail to coordinate dual-arm actions or mobile manipulation. Cross-object: performance drops sharply on uncommon objects, symbolic reasoning, and world knowledge. Cross-view: multi-view inputs effectively recover performance when the front view is occluded, underscoring the value of multi-view reasoning.

⚙️ Execution-Failure Diagnosis Is Extremely Hard

Diagnosing execution-level errors is far harder than planning-level ones: the best model reaches only 43.71 (most fall between 15 and 27), while planning-error diagnosis reaches 80.74; the human reference shows the same asymmetry (47.30 vs. 80.67). It requires fine-grained spatial and physical understanding, e.g., separating position errors from rotation errors.

In-depth Analysis: VLM-VLA Association

To probe how RoboBench scores relate to downstream robot-control evaluation, we convert several open-source VLM backbones into VLA policies with minimal fine-tuning and evaluate them on CALVIN and LIBERO-10.

In our 8-backbone analysis, RoboBench perception scores show strong associations with long-horizon CALVIN performance: object-centric perception reaches r=0.884 and scene-centric perception reaches r=0.833. For LIBERO-10, the strongest positive trend shifts toward fine-grained interaction dynamics, where static+dynamic affordance prediction reaches r=0.677.

These observed associations suggest that different VLA benchmarks emphasize different cognitive skills. RoboBench therefore serves not only as a leaderboard, but also as a diagnostic tool for identifying which VLM capabilities may be relevant to downstream embodied policies.

Dataset Construction Pipeline

Dataset Construction Pipeline. RoboBench integrates open-source and self-collected robot data under a shared process—preprocessing → tool-assisted + human-in-the-loop annotation → unified schema → auto-generated QA—and builds datasets for five dimensions: Instruction Comprehension: pair explicit instructions with LLM-rewritten implicit variants to test intent understanding. Perception Reasoning: use captioning/detection/segmentation tools to draft labels across robotic/object/scene/task views, then human-refine and standardize. Generalized Planning: construct a planning pool from robot videos; VLMs produce step/timestamp summaries and metadata, which are mapped to function templates to support Q1/Q2/Q3 evaluations. Affordance Prediction: sample key frames and annotate static (contact points), dynamic (trajectories), and mobile (base positions) affordances. Failure Analysis: mine execution-level failures from real trials and synthesize planning-level errors by perturbing correct instructions. All outputs follow one schema and are rendered into binary, single-choice, and multi-step multiple-choice QA formats for open- and closed-source MLLMs.

Planning Evaluation Pipeline

Planning Evaluation Framework. Evaluation of the planning dimension (Q1–Q3). Each task is decomposed into a sequence of parameterized atomic actions forming a Directed Acyclic Graph (DAG) that encodes causal and temporal dependencies. For Q1 (Long-horizon planning), an MLLM-based world simulator assesses both NodeCorrectness (action alignment) and TaskCompletion (goal-state achievement) by simulating action rollouts under visual and physical constraints. Q2 (Next-step planning) evaluates fine-grained step prediction by comparing skill, object, and parameter accuracy, while Q3 (Task state estimation) measures binary correctness on whether a subtask has been completed. Together, the pipeline provides a unified, interpretable framework for assessing structural correctness and embodied feasibility in planning.

BibTeX

@misc{luo2026robobenchcomprehensiveevaluationbenchmark,
      title={Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain},
      author={Yulin Luo and Chun-Kai Fan and Menghang Dong and Jiayu Shi and Xiangju Mi and Mengdi Zhao and Bo-Wen Zhang and Cheng Chi and Jiaming Liu and Gaole Dai and Rongyu Zhang and Ruichuan An and Kun Wu and Zhengping Che and Shaoxuan Xie and Guocai Yao and Zhongxia Zhao and Pengwei Wang and Guang Liu and Zhongyuan Wang and Tiejun Huang and Shanghang Zhang},
      year={2026},
      eprint={2510.17801},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2510.17801},
}